17 - Deep Learning - Plain Version 2020 [ID:21110]
50 von 88 angezeigt

Welcome back to deep learning. So today we want to talk about regularization techniques and we start

with a short introduction to regularization and the general problems of overfitting. So we will

first start talking about the background and ask the question what is the problem of regularization.

Then we want to talk about classical techniques, normalization, initialization, transfer learning

and multitask learning. So why are we talking about this topic so much? Well if you want to fit your

data then problems like this one would be easy to fit as they have a clear solution. Typically you

have the problem that your data is noisy and yet you cannot easily separate the classes. So what

you then run into is the problem of underfitting if you have a model that doesn't have a very high

capacity. Then you may have something like this line here which is not a very good fit to describe

the separation of the classes. The contrary is overfitting. So here we have models with very

high capacity which try to model everything that they observe in the training data. This may yield

decision boundaries that are not very reasonable. What we are actually interested in is a sensible

decision boundary that is somehow a compromise between the observed data and their actual

distribution. So we can analyze this problem by the so-called bias-variance decomposition.

Here we stick to the regression problems where we have an ideal function h of x that computes

some value and it's typically associated with some measurement noise. So there's some additional

value epsilon that is added to h of x. It may be distributed normally with a zero mean and

standard deviation of sigma. Now you can go ahead and use a model to estimate h. This is

denoted as f hat that is then estimated from some data set D. We can now express the loss for a

single point as the expected value of the loss. This would then simply be the L2 loss. So here

we take the true function minus the estimated function to the power of 2 and compute the

expected value. Interestingly this loss can be shown to be decomposable into two parts. This

is the bias which is essentially the deviation of the expected value of our model from the true

model. So this essentially measures how far we are off the ground truth. The other part can be

explained by the limited size of the training data set. We can always try to find a model that is

very flexible and tries to reduce the bias. What we get as a result is an increase in variance. So

the variance is the expected value of y hat minus the current value of y hat to the power of 2. This

is nothing else than the variance that we encounter in y hat. Then of course there is a small

irreproducible error. Now we can integrate this over every data point in x and we get the loss

for the entire training data set. By the way a similar decomposition exists for the classification

using the 1-0 loss which you can see in reference number 9. It's slightly different but it has some

similar implications. So we learn that with an increase in variance we can essentially reduce

the bias which means the prediction error of our model on the training data set. Let's visualize

this a bit. On the top left we see a low bias low variance model. This is essentially always right

and it doesn't have a lot of noise in the predictions. In the top right we see a high bias model that is

very consistent which means it has a low variance and is consistently off. In the bottom left we see

a low bias high variance model. This has a considerable degree of variation but on average

it's very close to where it's supposed to be. In the bottom right we have the case that we want

to omit. This is a high bias high variance model which has lots of noise and it's not even where

it's supposed to be. So we can choose a type of model for a given data set but simultaneously

optimizing bias and variance in general is impossible. Bias and variance can be studied

together as model capacity which we'll take a look at on the next slide. The capacity of a model

describes the variety of functions it can approximate. This is related to the number of

parameters so people often say that the number of parameters increase the number of parameters and

then you can get rid of your bias. This is true but it's by far not equal. To be exact you need

to compute the Wapnik-Chervonenkis dimension. This VC dimension is an exact measure of capacity

and it's based on counting how many points can be separated by the model. So the VC dimension of

neural networks is extremely high compared to classical methods and they have a very high model

capacity. They even manage to memorize random labels if you remember reference number 18. That's

again the paper that was looking into learning random labels on ImageNet. The VC dimension is

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:11:02 Min

Aufnahmedatum

2020-10-12

Hochgeladen am

2020-10-12 15:06:17

Sprache

en-US

Deep Learning - Regularization Part 1

This video discusses the problem of over- and underfitting. In order to get a better understanding, we explore the bias-variance trade-off and look into the effects of training data size and the number of parameters.

For reminders to watch the new video follow on Twitter or LinkedIn.

Further Reading:
A gentle Introduction to Deep Learning

Einbetten
Wordpress FAU Plugin
iFrame
Teilen